A high percentage of bank profits come from the interest on home improvement loans and debt consolidation. When loan defaults occur, banks can incur major losses in profit. It is therefore crucial that banks give out loans with care and try to better understand which clients are more or less likely to default on their loans. It is challenging to review clients manually to try and determine whether or not they have good credit or are more likely to default, and individual biases can seep into the process. With machine learning techniques, models can be built to learn the loan approval process and attempt to rid the process of biases.
To bulid a classification model (or multiple models) to predict whether or not clients are likely to default on their loan. This classification model should help to provide recommendations and important features for the bank to consider.
A bank's consumer credit department aims to simplify the decision-making process for home equity lines of credit to be accepted. To do this, they will adopt the Equal Credit Opportunity Act's guidelines to establish an empirically derived and statistically sound model for credit scoring. The model will be based on the data obtained via the existing loan underwriting process from recent applicants who have been given credit. The model will be built from predictive modeling techniques, but the model created must be interpretable enough to provide a justification for any adverse behavior (rejections).
The Home Equity dataset (HMEQ) contains baseline and loan performance information for recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. There are 12 input variables registered for each applicant.
# Install imblearn library
!pip install imblearn
Requirement already satisfied: imblearn in c:\users\mwwol\anaconda3\lib\site-packages (0.0) Requirement already satisfied: imbalanced-learn in c:\users\mwwol\anaconda3\lib\site-packages (from imblearn) (0.9.1) Requirement already satisfied: numpy>=1.17.3 in c:\users\mwwol\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.21.5) Requirement already satisfied: joblib>=1.0.0 in c:\users\mwwol\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.1.0) Requirement already satisfied: scikit-learn>=1.1.0 in c:\users\mwwol\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.1.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\mwwol\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (2.2.0) Requirement already satisfied: scipy>=1.3.2 in c:\users\mwwol\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.7.3)
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Algorithms to use
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score, precision_recall_curve
from sklearn import metrics
import scipy.stats as stats
# For hyperparameter tuning and SMOTE analysis
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
# Load and read dataset
loans = pd.read_csv("hmeq.csv")
# Copy data to avoid changes to original dataset
df = loans.copy()
# View first entries of dataset
df.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
# View last entries of the dataset
df.tail()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5955 | 0 | 88900 | 57264.0 | 90185.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 221.808718 | 0.0 | 16.0 | 36.112347 |
| 5956 | 0 | 89000 | 54576.0 | 92937.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 208.692070 | 0.0 | 15.0 | 35.859971 |
| 5957 | 0 | 89200 | 54045.0 | 92924.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 212.279697 | 0.0 | 15.0 | 35.556590 |
| 5958 | 0 | 89800 | 50370.0 | 91861.0 | DebtCon | Other | 14.0 | 0.0 | 0.0 | 213.892709 | 0.0 | 16.0 | 34.340882 |
| 5959 | 0 | 89900 | 48811.0 | 88934.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 219.601002 | 0.0 | 16.0 | 34.571519 |
# Look at datatypes and number of null values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
Observations:
df.isnull().sum()
BAD 0 LOAN 0 MORTDUE 518 VALUE 112 REASON 252 JOB 279 YOJ 515 DEROG 708 DELINQ 580 CLAGE 308 NINQ 510 CLNO 222 DEBTINC 1267 dtype: int64
(df.isnull().sum() / df.shape[0] * 100)
BAD 0.000000 LOAN 0.000000 MORTDUE 8.691275 VALUE 1.879195 REASON 4.228188 JOB 4.681208 YOJ 8.640940 DEROG 11.879195 DELINQ 9.731544 CLAGE 5.167785 NINQ 8.557047 CLNO 3.724832 DEBTINC 21.258389 dtype: float64
Observations:
# Create list with categorical variables and loan repayment outcome (which is binary and categorical)
cols = df.select_dtypes(['object']).columns.tolist()
cols
['REASON', 'JOB']
# Converting object datatype to category to reduce dataspace
for i in cols:
df[i] = df[i].astype('category')
# Checking conversion
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null category 5 JOB 5681 non-null category 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: category(2), float64(9), int64(2) memory usage: 524.3 KB
# Function for visualizing categorical variables
def perc_on_bar(feature, titl):
plt.figure(figsize = (8,5))
ax = sns.countplot(data = df, x = feature)
for p in ax.patches:
txt = np.round((p.get_height() / len(feature) * 100), 1)
annot = txt.astype('str')
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
ax.annotate(annot + '%', (x,y))
plt.title(titl)
plt.show()
# Plot percentages of reasons for seeking loans in dataset
reas_title = "Reasons for Loan Requests"
perc_on_bar(df["REASON"], reas_title)
# Plot percentage of different jobs in dataset
job_title = "Different Jobs in Dataset"
perc_on_bar(df["JOB"], job_title)
Observations:
# Plot percentages of loan defaults and repayments in dataset
bad_title = "Defaults (1) or loan repayments (0)"
perc_on_bar(df["BAD"], bad_title)
Observations:
# Compare the loan amounts for home improvement vs. debt consolidation
plt.figure(figsize = (8,5))
sns.boxplot(data = df, x = 'REASON', y = 'LOAN', showmeans = True)
plt.show()
Observations:
pd.pivot_table(data = df, index = "REASON", values ='BAD', aggfunc = ['count', np.mean, np.std]).T
| REASON | DebtCon | HomeImp | |
|---|---|---|---|
| count | BAD | 3928.000000 | 1780.000000 |
| mean | BAD | 0.189664 | 0.222472 |
| std | BAD | 0.392085 | 0.416023 |
Observations:
# Create a pivot table to compare loan amounts for different types of jobs
plt.figure(figsize = (10,6))
sns.boxplot(data = df, x = 'JOB', y = 'LOAN', showmeans = True)
plt.show()
Observations:
# Create a pivot table to compare default rates for different types of jobs
pd.pivot_table(df, index = "JOB", values = 'BAD', aggfunc = ['count', np.mean, np.std]).T
| JOB | Mgr | Office | Other | ProfExe | Sales | Self | |
|---|---|---|---|---|---|---|---|
| count | BAD | 767.000000 | 948.000000 | 2388.000000 | 1276.000000 | 109.000000 | 193.000000 |
| mean | BAD | 0.233377 | 0.131857 | 0.231993 | 0.166144 | 0.348624 | 0.300518 |
| std | BAD | 0.423256 | 0.338513 | 0.422193 | 0.372356 | 0.478736 | 0.459676 |
Observations:
# Create function for univariate analysis on numerical values in dataset
def num_uni(feature, feat_name):
print(feat_name)
print('Skew :',round(feature.skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
feature.hist(bins = 10, grid = False)
plt.axvline(feature.mean(), color = "green", linestyle = "--")
plt.axvline(feature.median(), color = "black", linestyle = "--")
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = feature, showmeans = True)
plt.show()
loan_name = "LOAN = Amount of Loan"
num_uni(df["LOAN"], loan_name)
LOAN = Amount of Loan Skew : 2.02
Observations:
mortdue_name = "MORTDUE = Mortgage Still Due"
num_uni(df["MORTDUE"], mortdue_name)
MORTDUE = Mortgage Still Due Skew : 1.81
Observations:
val_name = "VALUE = Value of Property"
num_uni(df["VALUE"], val_name)
VALUE = Value of Property Skew : 3.05
Observations:
yoj_name = "YOJ = Years at Present Job"
num_uni(df["YOJ"], yoj_name)
YOJ = Years at Present Job Skew : 0.99
Observations:
derog_name = "DEROG = Number of Derogatory Reports"
num_uni(df["DEROG"], derog_name)
DEROG = Number of Derogatory Reports Skew : 5.32
delinq_name = "DELINQ = Number of Delinquent Credit Lines"
num_uni(df["DELINQ"], delinq_name)
DELINQ = Number of Delinquent Credit Lines Skew : 4.02
Observations:
clage_name = "CLAGE = Age of Oldest Credit Line"
num_uni(df["CLAGE"], clage_name)
CLAGE = Age of Oldest Credit Line Skew : 1.34
Observations:
ninq_name = "NINQ = Number of Recent Credit Inquiries"
num_uni(df["NINQ"], ninq_name)
NINQ = Number of Recent Credit Inquiries Skew : 2.62
Observations:
clno_name = "CLNO = Number of Existing Credit Lines"
num_uni(df["CLNO"], clno_name)
CLNO = Number of Existing Credit Lines Skew : 0.78
Observations:
debtinc_name = "DEBTINC = Debt-to-Income Ratio"
num_uni(df["DEBTINC"], debtinc_name)
DEBTINC = Debt-to-Income Ratio Skew : 2.85
Observations:
# Visualize the loan outcomes vs. different numerical values
fig, axes = plt.subplots(5, 2, figsize = (16, 42))
fig.suptitle('Boxplots for all numerical variables vs. loan defaults', size = 20)
sns.boxplot(x = 'BAD', y = 'LOAN', data = df, ax = axes[0, 0]);
sns.boxplot(x = 'BAD', y = 'MORTDUE', data = df, ax = axes[0, 1]);
sns.boxplot(x = 'BAD', y = 'VALUE', data = df, ax = axes[1, 0]);
sns.boxplot(x = 'BAD', y = 'YOJ', data = df, ax = axes[1, 1]);
sns.boxplot(x = 'BAD', y = 'DEROG', data = df, ax = axes[2, 0]);
sns.boxplot(x = 'BAD', y = 'DELINQ', data = df, ax = axes[2, 1]);
sns.boxplot(x = 'BAD', y = 'CLAGE', data = df, ax = axes[3, 0]);
sns.boxplot(x = 'BAD', y = 'NINQ', data = df, ax = axes[3, 1]);
sns.boxplot(x = 'BAD', y = 'CLNO', data = df, ax = axes[4, 0]);
sns.boxplot(x = 'BAD', y = 'DEBTINC', data = df, ax = axes[4, 1])
fig.tight_layout()
fig.subplots_adjust(top=0.95)
Observations:
# Visualize reason for loans vs. different numerical values
fig, axes = plt.subplots(5, 2, figsize = (16, 42))
fig.suptitle('Boxplots for all numerical variables vs. reason for loans', size = 20)
sns.boxplot(x = 'REASON', y = 'LOAN', data = df, ax = axes[0, 0], palette = "Set2");
sns.boxplot(x = 'REASON', y = 'MORTDUE', data = df, ax = axes[0, 1], palette = "Set2");
sns.boxplot(x = 'REASON', y = 'VALUE', data = df, ax = axes[1, 0], palette = "Set2");
sns.boxplot(x = 'REASON', y = 'YOJ', data = df, ax = axes[1, 1], palette = "Set2");
sns.boxplot(x = 'REASON', y = 'DEROG', data = df, ax = axes[2, 0], palette = "Set2");
sns.boxplot(x = 'REASON', y = 'DELINQ', data = df, ax = axes[2, 1], palette = "Set2");
sns.boxplot(x = 'REASON', y = 'CLAGE', data = df, ax = axes[3, 0], palette = "Set2");
sns.boxplot(x = 'REASON', y = 'NINQ', data = df, ax = axes[3, 1], palette = "Set2");
sns.boxplot(x = 'REASON', y = 'CLNO', data = df, ax = axes[4, 0], palette = "Set2");
sns.boxplot(x = 'REASON', y = 'DEBTINC', data = df, ax = axes[4, 1], palette = "Set2")
fig.tight_layout()
fig.subplots_adjust(top=0.95)
Observations:
# Visualize reason for loans vs. different numerical values
fig, axes = plt.subplots(10, 1, figsize = (16, 60))
fig.suptitle('Boxplots for all numerical variables vs. client jobs', size = 20)
sns.boxplot(x = 'JOB', y = 'LOAN', data = df, ax = axes[0], palette = "Set2");
sns.boxplot(x = 'JOB', y = 'MORTDUE', data = df, ax = axes[1], palette = "Set2");
sns.boxplot(x = 'JOB', y = 'VALUE', data = df, ax = axes[2], palette = "Set2");
sns.boxplot(x = 'JOB', y = 'YOJ', data = df, ax = axes[3], palette = "Set2");
sns.boxplot(x = 'JOB', y = 'DEROG', data = df, ax = axes[4], palette = "Set2");
sns.boxplot(x = 'JOB', y = 'DELINQ', data = df, ax = axes[5], palette = "Set2");
sns.boxplot(x = 'JOB', y = 'CLAGE', data = df, ax = axes[6], palette = "Set2");
sns.boxplot(x = 'JOB', y = 'NINQ', data = df, ax = axes[7], palette = "Set2");
sns.boxplot(x = 'JOB', y = 'CLNO', data = df, ax = axes[8], palette = "Set2");
sns.boxplot(x = 'JOB', y = 'DEBTINC', data = df, ax = axes[9], palette = "Set2")
fig.tight_layout()
fig.subplots_adjust(top=0.95)
Observations:
# Create pairplot to look at correlations between numerical variables
sns.pairplot(data = df.iloc[:, 1:], kind = 'scatter', corner = True, dropna = True)
plt.show()
Observations:
# Plot a heatmap to look at possible correlations between numerical variables
numerical_col = df.select_dtypes(include=np.number).columns.tolist()
corr = df[numerical_col].corr()
plt.figure(figsize=(16,12))
sns.heatmap(corr,cmap='coolwarm', annot = True, vmax=1,vmin=-1,
fmt=".2f",
xticklabels=corr.columns,
yticklabels=corr.columns);
Observations:
# Change dependent outcome variable to categorical
df['BAD'] = df['BAD'].astype('category')
df_treat5: All entries with missing values will be removed, and the values of outliers will be changed to the lower or upper whisker of the variable boxplot.
These are not all of the different data treatments that can be done, but they will allow for model comparison between different types of data treatment.
# Copy modified dataframe for different treatments
df_treat1 = df.copy()
df_treat2 = df.copy()
df_treat3 = df.copy()
df_treat4 = df.copy()
df_treat5 = df.copy()
df_treat5.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
# Treat Missing values in numerical columns with median and mode in categorical variables, for df_treat 1,2,3,4
# Replace Numerical missing values for df_treat1
for col in df_treat1.select_dtypes('number').iloc[:, 1:]:
df_treat1[col].fillna(value = df_treat1[col].median(), inplace = True)
# Replace categorical missing values for df_treat1
df_treat1["REASON"].fillna(value = "DebtCon", inplace = True)
df_treat1["JOB"].fillna(value = "Other", inplace = True)
# df_treat2
for col in df_treat2.select_dtypes('number').iloc[:, 1:]:
df_treat2[col].fillna(value = df_treat2[col].median(), inplace = True)
df_treat2["REASON"].fillna(value = "DebtCon", inplace = True)
df_treat2["JOB"].fillna(value = "Other", inplace = True)
# df_treat3
for col in df_treat3.select_dtypes('number').iloc[:, 1:]:
df_treat3[col].fillna(value = df_treat3[col].median(), inplace = True)
df_treat3["REASON"].fillna(value = "DebtCon", inplace = True)
df_treat3["JOB"].fillna(value = "Other", inplace = True)
# df_treat4
for col in df_treat4.select_dtypes('number').iloc[:, 1:]:
df_treat4[col].fillna(value = df_treat4[col].median(), inplace = True)
df_treat4["REASON"].fillna(value = "DebtCon", inplace = True)
df_treat4["JOB"].fillna(value = "Other", inplace = True)
# Check if there are any null values remaining.
df_treat4.isna().sum()
BAD 0 LOAN 0 MORTDUE 0 VALUE 0 REASON 0 JOB 0 YOJ 0 DEROG 0 DELINQ 0 CLAGE 0 NINQ 0 CLNO 0 DEBTINC 0 dtype: int64
Observations:
# Remove missing values for df_treat5
df_treat5 = df_treat5.dropna()
# Check for any missing values
df_treat5.isna().sum()
BAD 0 LOAN 0 MORTDUE 0 VALUE 0 REASON 0 JOB 0 YOJ 0 DEROG 0 DELINQ 0 CLAGE 0 NINQ 0 CLNO 0 DEBTINC 0 dtype: int64
# Check shape of df_treat5
df_treat5.shape
(3364, 13)
# Set outliers to upper or lower whisker bounds of boxplots for df_treat1 and df_treat5
numer_cols = df_treat1.select_dtypes("number")
for col in numer_cols:
Q1 = np.percentile(df_treat1[col], 25)
Q3 = np.percentile(df_treat1[col], 75)
IQR = Q3 - Q1
Low_whis = Q1 - 1.5*IQR
High_whis = Q3 + 1.5*IQR
df_treat1[col] = np.clip(df_treat1[col], Low_whis, High_whis)
numer_cols5 = df_treat5.select_dtypes("number")
for col in numer_cols5:
Q1 = np.percentile(df_treat5[col], 25)
Q3 = np.percentile(df_treat5[col], 75)
IQR = Q3 - Q1
Low_whis = Q1 - 1.5*IQR
High_whis = Q3 + 1.5*IQR
df_treat5[col] = np.clip(df_treat5[col], Low_whis, High_whis)
# Look at a couple of examples of treated numerical variables
mortdue_name = "MORTDUE = Mortgage Still Due"
num_uni(df_treat1["MORTDUE"], mortdue_name)
clage_name = "CLAGE = Age of Oldest Credit Line"
num_uni(df_treat1["CLAGE"], clage_name)
mortdue_name = "MORTDUE = Mortgage Still Due"
num_uni(df_treat5["MORTDUE"], mortdue_name)
MORTDUE = Mortgage Still Due Skew : 0.62
CLAGE = Age of Oldest Credit Line Skew : 0.48
MORTDUE = Mortgage Still Due Skew : 0.67
Observations:
# Set outliers to upper or lower whisker bounds of boxplots for df_treat3, except for DEROG and DELINQ variables
numer_cols3 = ["LOAN", "MORTDUE", "VALUE", "YOJ", "CLAGE", "NINQ", "CLNO", "DEBTINC"]
for col in numer_cols3:
Q1 = np.percentile(df_treat3[col], 25)
Q3 = np.percentile(df_treat3[col], 75)
IQR = Q3 - Q1
Low_whis = Q1 - 1.5*IQR
High_whis = Q3 + 1.5*IQR
df_treat3[col] = np.clip(df_treat3[col], Low_whis, High_whis)
# Look at a couple of examples of treated numerical variables
mortdue_name = "MORTDUE = Mortgage Still Due"
num_uni(df_treat3["MORTDUE"], mortdue_name)
debtinc_name = "DEBTINC = debt-to-income ratio"
num_uni(df_treat3["DEBTINC"], debtinc_name)
derog_name = "DEROG = Number of Derogatory reports"
num_uni(df_treat3["DEROG"], derog_name)
MORTDUE = Mortgage Still Due Skew : 0.62
DEBTINC = debt-to-income ratio Skew : -0.53
DEROG = Number of Derogatory reports Skew : 5.69
# Set outliers to upper or lower whisker bounds of boxplots for df_treat3, except for DEROG and DELINQ variables
numer_cols4 = ["LOAN", "MORTDUE", "VALUE", "YOJ", "CLAGE", "NINQ", "CLNO", "DEBTINC"]
for col in numer_cols4:
Q1 = np.percentile(df_treat4[col], 25)
Q3 = np.percentile(df_treat4[col], 75)
IQR = Q3 - Q1
Low_whis = Q1 - 1.5*IQR
High_whis = Q3 + 1.5*IQR
df_treat4[col] = np.clip(df_treat4[col], Low_whis, High_whis)
# Making DEROG and DELINQ binary variables
df_treat4['DEROG'].loc[df_treat4['DEROG'] >= 1] = 1
df_treat4['DELINQ'].loc[df_treat4['DELINQ'] >= 1] = 1
Techniques to explore:
Overall Solution Design:
Measures of Success:
# Set 'BAD' which is loan defaults, to dependent variable
X1 = df_treat1.drop(columns = 'BAD')
Y1 = df_treat1['BAD']
X2 = df_treat2.drop(columns = 'BAD')
Y2 = df_treat2['BAD']
X3 = df_treat3.drop(columns = 'BAD')
Y3 = df_treat3['BAD']
X4 = df_treat4.drop(columns = 'BAD')
Y4 = df_treat4['BAD']
X5 = df_treat5.drop(columns = 'BAD')
Y5 = df_treat5['BAD']
# Create dummy values for categorical variables
X1 = pd.get_dummies(X1, prefix = ['REASON', 'JOB'], columns = ['REASON', 'JOB'])
X2 = pd.get_dummies(X2, prefix = ['REASON', 'JOB'], columns = ['REASON', 'JOB'])
X3 = pd.get_dummies(X3, prefix = ['REASON', 'JOB'], columns = ['REASON', 'JOB'])
X4 = pd.get_dummies(X4, prefix = ['REASON', 'JOB'], columns = ['REASON', 'JOB'])
X5 = pd.get_dummies(X5, prefix = ['REASON', 'JOB'], columns = ['REASON', 'JOB'])
# Splitting data in to training and testing sets, 70/30
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, Y1, test_size = 0.30, random_state = 1)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, Y2, test_size = 0.30, random_state = 1)
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, Y3, test_size = 0.30, random_state = 1)
X4_train, X4_test, y4_train, y4_test = train_test_split(X4, Y4, test_size = 0.30, random_state = 1)
X5_train, X5_test, y5_train, y5_test = train_test_split(X5, Y5, test_size = 0.30, random_state = 1)
# Checking Splitting
print("Shape of the training set: ", X5_train.shape)
print("Shape of the test set: ", X5_test.shape)
print("Percentage of classes in the training set:")
print(y5_train.value_counts(normalize = True))
print("Percentage of classes in the test set:")
print(y5_test.value_counts(normalize = True))
Shape of the training set: (2354, 18) Shape of the test set: (1010, 18) Percentage of classes in the training set: 0 0.913764 1 0.086236 Name: BAD, dtype: float64 Percentage of classes in the test set: 0 0.90396 1 0.09604 Name: BAD, dtype: float64
Observations:
# Function to print the classification report and get confusion matrix in a proper format
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Repaid', 'Defaulted'], yticklabels = ['Repaid', 'Defaulted'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Data Treatment 1 Decision Tree
# Fitting decision tree classifier on data with class weights
d_tree1 = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})
d_tree1.fit(X1_train, y1_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)# Checking fit on training data
y_pred_train1 = d_tree1.predict(X1_train)
metrics_score(y1_train, y_pred_train1)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
# Checking fit on testing data
y_pred_test1 = d_tree1.predict(X1_test)
metrics_score(y1_test, y_pred_test1)
precision recall f1-score support
0 0.90 0.93 0.91 1416
1 0.68 0.59 0.63 372
accuracy 0.86 1788
macro avg 0.79 0.76 0.77 1788
weighted avg 0.85 0.86 0.85 1788
Observations:
Data Treatment 2 Decision Tree
# Fitting decision tree classifier on data with class weights
d_tree2 = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})
d_tree2.fit(X2_train, y2_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)# Checking fit on training data
y_pred_train2 = d_tree2.predict(X2_train)
metrics_score(y2_train, y_pred_train2)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
# Checking fit on testing data
y_pred_test2 = d_tree2.predict(X2_test)
metrics_score(y2_test, y_pred_test2)
precision recall f1-score support
0 0.90 0.93 0.92 1416
1 0.69 0.62 0.65 372
accuracy 0.86 1788
macro avg 0.80 0.77 0.78 1788
weighted avg 0.86 0.86 0.86 1788
Observations:
Data Treatment 3 Decision Tree
# Fitting decision tree classifier on data with class weights
d_tree3 = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})
d_tree3.fit(X3_train, y3_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)# Checking fit on training data
y_pred_train3 = d_tree3.predict(X3_train)
metrics_score(y3_train, y_pred_train3)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
# Checking fit on testing data
y_pred_test3 = d_tree3.predict(X3_test)
metrics_score(y3_test, y_pred_test3)
precision recall f1-score support
0 0.90 0.93 0.91 1416
1 0.70 0.60 0.64 372
accuracy 0.86 1788
macro avg 0.80 0.76 0.78 1788
weighted avg 0.86 0.86 0.86 1788
Observations:
Data Treatment 4 Decision Tree
# Fitting decision tree classifier on data with class weights
d_tree4 = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})
d_tree4.fit(X4_train, y4_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)# Checking fit on training data
y_pred_train4 = d_tree4.predict(X4_train)
metrics_score(y4_train, y_pred_train4)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
# Checking fit on testing data
y_pred_test4 = d_tree4.predict(X4_test)
metrics_score(y4_test, y_pred_test4)
precision recall f1-score support
0 0.89 0.93 0.91 1416
1 0.69 0.58 0.63 372
accuracy 0.86 1788
macro avg 0.79 0.75 0.77 1788
weighted avg 0.85 0.86 0.85 1788
Observations:
Data Treatment 5 Decision Tree
# Fitting decision tree classifier on data with class weights
d_tree5 = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.1, 1: 0.9})
d_tree5.fit(X5_train, y5_train)
DecisionTreeClassifier(class_weight={0: 0.1, 1: 0.9}, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.1, 1: 0.9}, random_state=7)# Checking fit on training data
y_pred_train5 = d_tree5.predict(X5_train)
metrics_score(y5_train, y_pred_train5)
precision recall f1-score support
0 1.00 1.00 1.00 2151
1 1.00 1.00 1.00 203
accuracy 1.00 2354
macro avg 1.00 1.00 1.00 2354
weighted avg 1.00 1.00 1.00 2354
Observations:
# Checking fit on testing data
y_pred_test5 = d_tree5.predict(X5_test)
metrics_score(y5_test, y_pred_test5)
precision recall f1-score support
0 0.94 0.98 0.96 913
1 0.71 0.38 0.50 97
accuracy 0.93 1010
macro avg 0.82 0.68 0.73 1010
weighted avg 0.92 0.93 0.92 1010
Observations:
# Fitting decision tree classifier on data without class weights
d_tree2_nc = DecisionTreeClassifier(random_state = 7)
d_tree2_nc.fit(X2_train, y2_train)
DecisionTreeClassifier(random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=7)
# Checking fit on training data
y_pred_train2_nc = d_tree2_nc.predict(X2_train)
metrics_score(y2_train, y_pred_train2_nc)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
# Checking fit on testing data
y_pred_test2_nc = d_tree2_nc.predict(X2_test)
metrics_score(y2_test, y_pred_test2_nc)
precision recall f1-score support
0 0.90 0.93 0.91 1416
1 0.70 0.59 0.64 372
accuracy 0.86 1788
macro avg 0.80 0.76 0.78 1788
weighted avg 0.86 0.86 0.86 1788
Observations:
# Split training and testing data into 75/25
X2_train25, X2_test25, y2_train25, y2_test25 = train_test_split(X2, Y2, test_size = 0.25, random_state = 1)
# Split training and testing data into 80/20
X2_train20, X2_test20, y2_train20, y2_test20 = train_test_split(X2, Y2, test_size = 0.20, random_state = 1)
# Fitting decision tree classifier on data with class weights for 75/25 split
d_tree2_25 = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})
d_tree2_25.fit(X2_train25, y2_train25)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)# Checking fit on training data for 75/25 split
y_pred_train2_25 = d_tree2_25.predict(X2_train25)
metrics_score(y2_train25, y_pred_train2_25)
precision recall f1-score support
0 1.00 1.00 1.00 3600
1 1.00 1.00 1.00 870
accuracy 1.00 4470
macro avg 1.00 1.00 1.00 4470
weighted avg 1.00 1.00 1.00 4470
# Checking fit on testing data for 75/25 split
y_pred_test2_25 = d_tree2_25.predict(X2_test25)
metrics_score(y2_test25, y_pred_test2_25)
precision recall f1-score support
0 0.90 0.93 0.91 1171
1 0.71 0.61 0.66 319
accuracy 0.86 1490
macro avg 0.80 0.77 0.79 1490
weighted avg 0.86 0.86 0.86 1490
Observations:
# Fitting decision tree classifier on data with class weights for 80/20 split
d_tree2_20 = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})
d_tree2_20.fit(X2_train20, y2_train20)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)# Checking fit on training data for 80/20 split
y_pred_train2_20 = d_tree2_20.predict(X2_train20)
metrics_score(y2_train20, y_pred_train2_20)
precision recall f1-score support
0 1.00 1.00 1.00 3827
1 1.00 1.00 1.00 941
accuracy 1.00 4768
macro avg 1.00 1.00 1.00 4768
weighted avg 1.00 1.00 1.00 4768
# Checking fit on testing data for 80/20 split
y_pred_test2_20 = d_tree2_20.predict(X2_test20)
metrics_score(y2_test20, y_pred_test2_20)
precision recall f1-score support
0 0.90 0.93 0.91 944
1 0.69 0.60 0.65 248
accuracy 0.86 1192
macro avg 0.80 0.77 0.78 1192
weighted avg 0.86 0.86 0.86 1192
Observations:
# Create minority sampled dataset with SMOTE
sm1 = SMOTE(random_state=42)
X1_smo, y1_smo = sm1.fit_resample(X1_train, y1_train)
sm2 = SMOTE(random_state=42)
X2_smo, y2_smo = sm2.fit_resample(X2_train, y2_train)
sm3 = SMOTE(random_state=42)
X3_smo, y3_smo = sm3.fit_resample(X3_train, y3_train)
sm4 = SMOTE(random_state=42)
X4_smo, y4_smo = sm4.fit_resample(X4_train, y4_train)
X1_smo.shape
(6710, 18)
# Visualize results of SMOTE
perc_on_bar(y2_smo, bad_title)
# Look at a couple of examples of numerical variables from SMOTE Analysis for data treatment 1
mortdue_name = "MORTDUE = Mortgage Still Due"
num_uni(X1_smo["MORTDUE"], mortdue_name)
debtinc_name = "DEBTINC = debt-to-income ratio"
num_uni(X1_smo["DEBTINC"], debtinc_name)
derog_name = "DEROG = Number of Derogatory reports"
num_uni(X1_smo["DEROG"], derog_name)
MORTDUE = Mortgage Still Due Skew : 0.65
DEBTINC = debt-to-income ratio Skew : -0.51
DEROG = Number of Derogatory reports Skew : 0
# Look at a couple of examples of numerical variables from SMOTE Analysis for data treatment 2
mortdue_name = "MORTDUE = Mortgage Still Due"
num_uni(X2_smo["MORTDUE"], mortdue_name)
debtinc_name = "DEBTINC = debt-to-income ratio"
num_uni(X2_smo["DEBTINC"], debtinc_name)
derog_name = "DEROG = Number of Derogatory reports"
num_uni(X2_smo["DEROG"], derog_name)
MORTDUE = Mortgage Still Due Skew : 1.85
DEBTINC = debt-to-income ratio Skew : 4.92
DEROG = Number of Derogatory reports Skew : 3.96
Observations:
# Fitting decision tree classifier on data treatment 1 with SMOTE analysis
d_tree1_smo = DecisionTreeClassifier(random_state = 7)
d_tree1_smo.fit(X1_smo, y1_smo)
DecisionTreeClassifier(random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=7)
# Checking fit on training data treatment 1 with SMOTE
y_pred_train1_smo = d_tree1_smo.predict(X1_smo)
metrics_score(y1_smo, y_pred_train1_smo)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 3355
accuracy 1.00 6710
macro avg 1.00 1.00 1.00 6710
weighted avg 1.00 1.00 1.00 6710
Observations:
# Checking fit on training data treatment 1 with SMOTE
y_pred_test1_smo = d_tree1_smo.predict(X1_test)
metrics_score(y1_test, y_pred_test1_smo)
precision recall f1-score support
0 0.90 0.90 0.90 1416
1 0.62 0.62 0.62 372
accuracy 0.84 1788
macro avg 0.76 0.76 0.76 1788
weighted avg 0.84 0.84 0.84 1788
Observations:
# Fitting decision tree classifier on data treatment 2 with SMOTE analysis
d_tree2_smo = DecisionTreeClassifier(random_state = 7)
d_tree2_smo.fit(X2_smo, y2_smo)
DecisionTreeClassifier(random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=7)
# Checking fit on training data treatment 2 with SMOTE
y_pred_train2_smo = d_tree2_smo.predict(X2_smo)
metrics_score(y2_smo, y_pred_train2_smo)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 3355
accuracy 1.00 6710
macro avg 1.00 1.00 1.00 6710
weighted avg 1.00 1.00 1.00 6710
Observations:
# Checking fit on training data treatment 2 with SMOTE
y_pred_test2_smo = d_tree2_smo.predict(X2_test)
metrics_score(y2_test, y_pred_test2_smo)
precision recall f1-score support
0 0.91 0.93 0.92 1416
1 0.69 0.64 0.66 372
accuracy 0.87 1788
macro avg 0.80 0.78 0.79 1788
weighted avg 0.86 0.87 0.86 1788
Observations:
# Fitting decision tree classifier on data treatment 2 with SMOTE analysis
d_tree3_smo = DecisionTreeClassifier(random_state = 7)
d_tree3_smo.fit(X3_smo, y3_smo)
DecisionTreeClassifier(random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=7)
# Checking fit on training data treatment 2 with SMOTE
y_pred_train3_smo = d_tree3_smo.predict(X3_smo)
metrics_score(y3_smo, y_pred_train3_smo)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 3355
accuracy 1.00 6710
macro avg 1.00 1.00 1.00 6710
weighted avg 1.00 1.00 1.00 6710
# Checking fit on training data treatment 2 with SMOTE
y_pred_test3_smo = d_tree3_smo.predict(X3_test)
metrics_score(y3_test, y_pred_test3_smo)
precision recall f1-score support
0 0.91 0.92 0.91 1416
1 0.68 0.65 0.67 372
accuracy 0.86 1788
macro avg 0.80 0.79 0.79 1788
weighted avg 0.86 0.86 0.86 1788
Observations:
# Fitting decision tree classifier on data treatment 1 with SMOTE analysis
d_tree4_smo = DecisionTreeClassifier(random_state = 7)
d_tree4_smo.fit(X4_smo, y4_smo)
DecisionTreeClassifier(random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=7)
# Checking fit on training data treatment 2 with SMOTE
y_pred_train4_smo = d_tree4_smo.predict(X4_smo)
metrics_score(y4_smo, y_pred_train4_smo)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 3355
accuracy 1.00 6710
macro avg 1.00 1.00 1.00 6710
weighted avg 1.00 1.00 1.00 6710
# Checking fit on training data treatment 2 with SMOTE
y_pred_test4_smo = d_tree4_smo.predict(X4_test)
metrics_score(y4_test, y_pred_test4_smo)
precision recall f1-score support
0 0.91 0.91 0.91 1416
1 0.66 0.66 0.66 372
accuracy 0.86 1788
macro avg 0.79 0.78 0.79 1788
weighted avg 0.86 0.86 0.86 1788
# Choose the type of classifier
d_tree_tuned = DecisionTreeClassifier(random_state = 7)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 10),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [5, 10, 20, 25, 30]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X2_smo, y2_smo)
# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data
d_tree_tuned.fit(X2_smo, y2_smo)
DecisionTreeClassifier(max_depth=8, min_samples_leaf=5, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=8, min_samples_leaf=5, random_state=7)
Observations:
# Checking fit on training data for the tuned tree
y_pred_train_tuned = d_tree_tuned.predict(X2_smo)
metrics_score(y2_smo, y_pred_train_tuned)
precision recall f1-score support
0 0.88 0.94 0.91 3355
1 0.93 0.87 0.90 3355
accuracy 0.90 6710
macro avg 0.91 0.90 0.90 6710
weighted avg 0.91 0.90 0.90 6710
Observations:
# Checking fit on training data for the tuned tree
y_pred_test_tuned = d_tree_tuned.predict(X2_test)
metrics_score(y2_test, y_pred_test_tuned)
precision recall f1-score support
0 0.90 0.93 0.92 1416
1 0.70 0.61 0.65 372
accuracy 0.87 1788
macro avg 0.80 0.77 0.79 1788
weighted avg 0.86 0.87 0.86 1788
# Choose the type of classifier
d_tree_tuned2 = DecisionTreeClassifier(random_state = 7)
# Grid of parameters to choose from
parameters = {'max_depth': [4, 5, 6],
'criterion': ['gini', 'entropy'],
'min_samples_split': np.arange(2, 10),
'splitter': ['best', 'random'],
'max_features' : [0.8, 0.9, 1.0],
'min_samples_leaf': [15, 20, 25]}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned2, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X2_smo, y2_smo)
# Set the classifier to the best combination of parameters
d_tree_tuned2 = grid_obj.best_estimator_
# Fit the best algorithm to the data
d_tree_tuned2.fit(X2_smo, y2_smo)
DecisionTreeClassifier(criterion='entropy', max_depth=6, max_features=0.8,
min_samples_leaf=25, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(criterion='entropy', max_depth=6, max_features=0.8,
min_samples_leaf=25, random_state=7)Observations:
# Checking fit on training data for the tuned tree
y_pred_train_tuned2 = d_tree_tuned2.predict(X2_smo)
metrics_score(y2_smo, y_pred_train_tuned2)
precision recall f1-score support
0 0.84 0.86 0.85 3355
1 0.85 0.83 0.84 3355
accuracy 0.85 6710
macro avg 0.85 0.85 0.85 6710
weighted avg 0.85 0.85 0.85 6710
# Checking fit on testing data for the tuned tree
y_pred_test_tuned2 = d_tree_tuned2.predict(X2_test)
metrics_score(y2_test, y_pred_test_tuned2)
precision recall f1-score support
0 0.93 0.86 0.89 1416
1 0.58 0.77 0.66 372
accuracy 0.84 1788
macro avg 0.76 0.81 0.78 1788
weighted avg 0.86 0.84 0.85 1788
Observations:
# Visualize the tuned decision tree
features = list(X2.columns)
plt.figure(figsize = (20, 20))
tree.plot_tree(d_tree_tuned2, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = True)
plt.show()
Observations:
# Importance of features in the tuned tree building
print (pd.DataFrame(d_tree_tuned2.feature_importances_, columns = ["Imp"], index = X2_smo.columns).sort_values(by = 'Imp', ascending = False))
Imp DELINQ 0.542399 DEBTINC 0.255244 CLAGE 0.057736 DEROG 0.044845 VALUE 0.025802 MORTDUE 0.016010 NINQ 0.013676 LOAN 0.012413 JOB_Office 0.012392 CLNO 0.009234 YOJ 0.008296 REASON_DebtCon 0.001953 REASON_HomeImp 0.000000 JOB_Mgr 0.000000 JOB_Other 0.000000 JOB_ProfExe 0.000000 JOB_Sales 0.000000 JOB_Self 0.000000
# Plotting the feature importances for the tuned decision tree
importances = d_tree_tuned2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize = (10, 10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observations:
Data Treatment 1 Random Forest
# Fitting the random forest tree classifier on the training data for df_treat1
rf_estimator1 = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator1.fit(X1_train, y1_train)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train1 = rf_estimator1.predict(X1_train)
metrics_score(y1_train, y_rf_train1)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
# Checking performance on the testing data
y_rf_test1 = rf_estimator1.predict(X1_test)
metrics_score(y1_test, y_rf_test1)
precision recall f1-score support
0 0.90 0.98 0.94 1416
1 0.86 0.60 0.71 372
accuracy 0.90 1788
macro avg 0.88 0.79 0.82 1788
weighted avg 0.90 0.90 0.89 1788
Observations:
Data Treatment 2 Random Forest
# Fitting the random forest tree classifier on the training data for df_treat2
rf_estimator2 = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator2.fit(X2_train, y2_train)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train2 = rf_estimator2.predict(X2_train)
metrics_score(y2_train, y_rf_train2)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
# Checking performance on the testing data
y_rf_test2 = rf_estimator2.predict(X2_test)
metrics_score(y2_test, y_rf_test2)
precision recall f1-score support
0 0.92 0.98 0.95 1416
1 0.91 0.66 0.77 372
accuracy 0.92 1788
macro avg 0.91 0.82 0.86 1788
weighted avg 0.92 0.92 0.91 1788
Observations:
Data Treatment 3 Random Forest
# Fitting the random forest tree classifier on the training data for df_treat3
rf_estimator3 = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator3.fit(X3_train, y3_train)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train3 = rf_estimator3.predict(X3_train)
metrics_score(y3_train, y_rf_train3)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
# Checking performance on the testing data
y_rf_test3 = rf_estimator3.predict(X3_test)
metrics_score(y3_test, y_rf_test3)
precision recall f1-score support
0 0.92 0.98 0.95 1416
1 0.90 0.67 0.77 372
accuracy 0.92 1788
macro avg 0.91 0.83 0.86 1788
weighted avg 0.92 0.92 0.91 1788
Observations:
Data Treatment 4 Random Forest
# Fitting the random forest tree classifier on the training data for df_treat2
rf_estimator4 = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator4.fit(X4_train, y4_train)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train4 = rf_estimator4.predict(X4_train)
metrics_score(y4_train, y_rf_train4)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
# Checking performance on the testing data
y_rf_test4 = rf_estimator4.predict(X4_test)
metrics_score(y4_test, y_rf_test4)
precision recall f1-score support
0 0.92 0.98 0.95 1416
1 0.89 0.67 0.77 372
accuracy 0.91 1788
macro avg 0.91 0.83 0.86 1788
weighted avg 0.91 0.91 0.91 1788
Observations:
Data Treatment 5 Random Forest
# Fitting the random forest tree classifier on the training data for df_treat5
rf_estimator5 = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator5.fit(X5_train, y5_train)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train5 = rf_estimator5.predict(X5_train)
metrics_score(y5_train, y_rf_train5)
precision recall f1-score support
0 1.00 1.00 1.00 2151
1 1.00 1.00 1.00 203
accuracy 1.00 2354
macro avg 1.00 1.00 1.00 2354
weighted avg 1.00 1.00 1.00 2354
Observations:
# Checking performance on the testing data
y_rf_test5 = rf_estimator5.predict(X5_test)
metrics_score(y5_test, y_rf_test5)
precision recall f1-score support
0 0.93 1.00 0.96 913
1 1.00 0.28 0.44 97
accuracy 0.93 1010
macro avg 0.96 0.64 0.70 1010
weighted avg 0.94 0.93 0.91 1010
Observations:
# Fitting random forest classifier on data treatment 2 with class weights
rf_estimator2_cw = RandomForestClassifier(random_state = 7, criterion = "entropy", class_weight = {0: 0.2, 1: 0.8})
rf_estimator2_cw.fit(X2_train, y2_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
random_state=7)# Checking performance on the training data
y_rf_train2_cw = rf_estimator2_cw.predict(X2_train)
metrics_score(y2_train, y_rf_train2_cw)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
# Checking performance on the testing data
y_rf_test2_cw = rf_estimator2_cw.predict(X2_test)
metrics_score(y2_test, y_rf_test2_cw)
precision recall f1-score support
0 0.91 0.98 0.95 1416
1 0.90 0.64 0.75 372
accuracy 0.91 1788
macro avg 0.91 0.81 0.85 1788
weighted avg 0.91 0.91 0.90 1788
Observations:
# Fitting random forest classifier on data treatment 2 75/25 train/test split
rf_estimator2_25 = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator2_25.fit(X2_train25, y2_train25)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train2_25 = rf_estimator2_25.predict(X2_train25)
metrics_score(y2_train25, y_rf_train2_25)
precision recall f1-score support
0 1.00 1.00 1.00 3600
1 1.00 1.00 1.00 870
accuracy 1.00 4470
macro avg 1.00 1.00 1.00 4470
weighted avg 1.00 1.00 1.00 4470
# Checking performance on the testing data
y_rf_test2_25 = rf_estimator2_25.predict(X2_test25)
metrics_score(y2_test25, y_rf_test2_25)
precision recall f1-score support
0 0.92 0.98 0.95 1171
1 0.91 0.67 0.77 319
accuracy 0.92 1490
macro avg 0.91 0.83 0.86 1490
weighted avg 0.92 0.92 0.91 1490
Observations:
# Fitting random forest classifier on data treatment 2 80/20 train/test split
rf_estimator2_20 = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator2_20.fit(X2_train20, y2_train20)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train2_20 = rf_estimator2_20.predict(X2_train20)
metrics_score(y2_train20, y_rf_train2_20)
precision recall f1-score support
0 1.00 1.00 1.00 3827
1 1.00 1.00 1.00 941
accuracy 1.00 4768
macro avg 1.00 1.00 1.00 4768
weighted avg 1.00 1.00 1.00 4768
# Checking performance on the testing data
y_rf_test2_20 = rf_estimator2_20.predict(X2_test20)
metrics_score(y2_test20, y_rf_test2_20)
precision recall f1-score support
0 0.92 0.98 0.95 944
1 0.91 0.68 0.78 248
accuracy 0.92 1192
macro avg 0.92 0.83 0.86 1192
weighted avg 0.92 0.92 0.91 1192
Observations:
# Fitting the random forest tree classifier on the training data for df_treat1 with SMOTE
rf_estimator1_smo = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator1_smo.fit(X1_smo, y1_smo)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train1_smo = rf_estimator1_smo.predict(X1_smo)
metrics_score(y1_smo, y_rf_train1_smo)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 3355
accuracy 1.00 6710
macro avg 1.00 1.00 1.00 6710
weighted avg 1.00 1.00 1.00 6710
# Checking performance on the testing data
y_rf_test1_smo = rf_estimator1_smo.predict(X1_test)
metrics_score(y1_test, y_rf_test1_smo)
precision recall f1-score support
0 0.93 0.96 0.95 1416
1 0.84 0.72 0.77 372
accuracy 0.91 1788
macro avg 0.88 0.84 0.86 1788
weighted avg 0.91 0.91 0.91 1788
Observations:
# Fitting the random forest tree classifier on the training data for df_treat2 with SMOTE
rf_estimator2_smo = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator2_smo.fit(X2_smo, y2_smo)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train2_smo = rf_estimator2_smo.predict(X2_smo)
metrics_score(y2_smo, y_rf_train2_smo)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 3355
accuracy 1.00 6710
macro avg 1.00 1.00 1.00 6710
weighted avg 1.00 1.00 1.00 6710
# Checking performance on the testing data
y_rf_test2_smo = rf_estimator2_smo.predict(X2_test)
metrics_score(y2_test, y_rf_test2_smo)
precision recall f1-score support
0 0.94 0.98 0.96 1416
1 0.92 0.77 0.84 372
accuracy 0.94 1788
macro avg 0.93 0.88 0.90 1788
weighted avg 0.94 0.94 0.94 1788
Observations:
# Fitting the random forest tree classifier on the training data for df_treat3 with SMOTE
rf_estimator3_smo = RandomForestClassifier(random_state = 7, criterion = "entropy")
rf_estimator3_smo.fit(X3_smo, y3_smo)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_rf_train3_smo = rf_estimator3_smo.predict(X3_smo)
metrics_score(y3_smo, y_rf_train3_smo)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 3355
accuracy 1.00 6710
macro avg 1.00 1.00 1.00 6710
weighted avg 1.00 1.00 1.00 6710
# Checking performance on the testing data
y_rf_test3_smo = rf_estimator3_smo.predict(X3_test)
metrics_score(y3_test, y_rf_test3_smo)
precision recall f1-score support
0 0.94 0.98 0.96 1416
1 0.91 0.77 0.83 372
accuracy 0.94 1788
macro avg 0.92 0.88 0.90 1788
weighted avg 0.94 0.94 0.93 1788
Observations:
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [100, 110, 120],
"max_depth": [4, 5, 6, 7],
"max_features": [0.8, 0.9, 1]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X2_smo, y2_smo)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned.fit(X2_smo, y2_smo)
RandomForestClassifier(criterion='entropy', max_depth=7, max_features=1,
n_estimators=120, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(criterion='entropy', max_depth=7, max_features=1,
n_estimators=120, random_state=7)# Checking performance on the training data
y_tune_train = rf_estimator_tuned.predict(X2_smo)
metrics_score(y2_smo, y_tune_train)
precision recall f1-score support
0 0.89 0.94 0.91 3355
1 0.94 0.88 0.91 3355
accuracy 0.91 6710
macro avg 0.91 0.91 0.91 6710
weighted avg 0.91 0.91 0.91 6710
# Checking performance on the testing data
y_tune_test = rf_estimator_tuned.predict(X2_test)
metrics_score(y2_test, y_tune_test)
precision recall f1-score support
0 0.90 0.94 0.92 1416
1 0.71 0.60 0.65 372
accuracy 0.87 1788
macro avg 0.81 0.77 0.79 1788
weighted avg 0.86 0.87 0.86 1788
Observations:
# Choose the type of classifier
rf_estimator_tuned2 = RandomForestClassifier(random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
"criterion" : ["entropy", "gini"],
"max_depth": [6, 7],
"max_features": [0.8, 0.9],
"min_samples_leaf": [20, 25],
"max_samples": [0.9, 1.0]}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned2, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X2_smo, y2_smo)
# Set the classifier to the best combination of parameters
rf_estimator_tuned2 = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned2.fit(X2_smo, y2_smo)
RandomForestClassifier(criterion='entropy', max_depth=7, max_features=0.8,
max_samples=0.9, min_samples_leaf=25, n_estimators=110,
random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(criterion='entropy', max_depth=7, max_features=0.8,
max_samples=0.9, min_samples_leaf=25, n_estimators=110,
random_state=7)# Checking performance on the training data
y_tune_train2 = rf_estimator_tuned2.predict(X2_smo)
metrics_score(y2_smo, y_tune_train2)
precision recall f1-score support
0 0.86 0.91 0.89 3355
1 0.91 0.85 0.88 3355
accuracy 0.88 6710
macro avg 0.88 0.88 0.88 6710
weighted avg 0.88 0.88 0.88 6710
# Checking performance on the testing data
y_tune_test2 = rf_estimator_tuned2.predict(X2_test)
metrics_score(y2_test, y_tune_test2)
precision recall f1-score support
0 0.92 0.92 0.92 1416
1 0.69 0.71 0.70 372
accuracy 0.87 1788
macro avg 0.80 0.81 0.81 1788
weighted avg 0.87 0.87 0.87 1788
Observations:
# Importance of features in the forest building
print (pd.DataFrame(rf_estimator2_smo.feature_importances_, columns = ["Imp"], index = X2_smo.columns).sort_values(by = 'Imp', ascending = False))
Imp DELINQ 0.173612 DEBTINC 0.134882 DEROG 0.089480 CLAGE 0.080422 NINQ 0.079458 VALUE 0.073072 MORTDUE 0.069211 LOAN 0.065517 CLNO 0.062063 YOJ 0.059764 JOB_Office 0.021742 REASON_DebtCon 0.019486 REASON_HomeImp 0.018214 JOB_Other 0.017133 JOB_ProfExe 0.016758 JOB_Mgr 0.012720 JOB_Sales 0.003710 JOB_Self 0.002757
# Plot feature importances for the second tuned Random Forest model
importances = rf_estimator2_smo.feature_importances_
indices = np.argsort(importances)
feature_names = list(X2.columns)
plt.figure(figsize = (12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observations:
# Scaling the data
sc = StandardScaler()
X1_sc = sc.fit_transform(X1)
X1_sc = pd.DataFrame(X1_sc, columns = X1.columns)
X2_sc = sc.fit_transform(X2)
X2_sc = pd.DataFrame(X2_sc, columns = X2.columns)
X3_sc = sc.fit_transform(X3)
X3_sc = pd.DataFrame(X3_sc, columns = X3.columns)
X4_sc = sc.fit_transform(X4)
X4_sc = pd.DataFrame(X4_sc, columns = X4.columns)
X5_sc = sc.fit_transform(X5)
X5_sc = pd.DataFrame(X5_sc, columns = X5.columns)
# Splitting data in to training and testing sets, 70/30
X1_train_sc, X1_test_sc, y1_train_sc, y1_test_sc = train_test_split(X1_sc, Y1, test_size = 0.30, random_state = 1)
X2_train_sc, X2_test_sc, y2_train_sc, y2_test_sc = train_test_split(X2_sc, Y2, test_size = 0.30, random_state = 1)
X3_train_sc, X3_test_sc, y3_train_sc, y3_test_sc = train_test_split(X3_sc, Y3, test_size = 0.30, random_state = 1)
X4_train_sc, X4_test_sc, y4_train_sc, y4_test_sc = train_test_split(X4_sc, Y4, test_size = 0.30, random_state = 1)
X5_train_sc, X5_test_sc, y5_train_sc, y5_test_sc = train_test_split(X5_sc, Y5, test_size = 0.30, random_state = 1)
# Creating logistic regression model for data treatment 1
lrm1 = LogisticRegression(penalty = 'none')
lrm1.fit(X1_train_sc, y1_train_sc)
lr1_error_rate = (1 - lrm1.score(X1_train_sc, y1_train_sc))*100
lrm_r1 = LogisticRegression(penalty = 'l1', solver = 'liblinear')
lrm_r1.fit(X1_train_sc, y1_train_sc)
lrm_r1_error_rate = (1 - lrm_r1.score(X1_train_sc, y1_train_sc))*100
print("Logistic Regression error rate using all the features is {}% ".format(np.round(lr1_error_rate, 2)))
print("Logistic Regression error rate using absolute value (Lasso) regularization is {}% ".format(np.round(lrm_r1_error_rate, 2)))
Logistic Regression error rate using all the features is 18.79% Logistic Regression error rate using absolute value (Lasso) regularization is 18.82%
y_lrm_train1 = lrm_r1.predict(X1_train_sc)
metrics_score(y1_train_sc, y_lrm_train1)
precision recall f1-score support
0 0.82 0.98 0.89 3355
1 0.62 0.10 0.17 817
accuracy 0.81 4172
macro avg 0.72 0.54 0.53 4172
weighted avg 0.78 0.81 0.75 4172
Observations:
y_lrm_test1 = lrm_r1.predict(X1_test_sc)
metrics_score(y1_test_sc, y_lrm_test1)
precision recall f1-score support
0 0.80 0.98 0.88 1416
1 0.53 0.08 0.14 372
accuracy 0.79 1788
macro avg 0.66 0.53 0.51 1788
weighted avg 0.74 0.79 0.73 1788
Observations:
# Creating logistic regression model for data treatment 2
lrm2 = LogisticRegression(penalty = 'none')
lrm2.fit(X2_train_sc, y2_train_sc)
lr2_error_rate = (1 - lrm2.score(X2_train_sc, y2_train_sc))*100
lrm_r2 = LogisticRegression(penalty = 'l1', solver = 'liblinear')
lrm_r2.fit(X2_train_sc, y2_train_sc)
lrm_r2_error_rate = (1 - lrm_r2.score(X2_train_sc, y2_train_sc))*100
print("Logistic Regression error rate using all the features is {}% ".format(np.round(lr2_error_rate, 2)))
print("Logistic Regression error rate using absolute value (Lasso) regularization is {}% ".format(np.round(lrm_r2_error_rate, 2)))
Logistic Regression error rate using all the features is 15.36% Logistic Regression error rate using absolute value (Lasso) regularization is 15.39%
y_lrm_train2 = lrm_r2.predict(X2_train_sc)
metrics_score(y2_train_sc, y_lrm_train2)
precision recall f1-score support
0 0.86 0.97 0.91 3355
1 0.74 0.33 0.46 817
accuracy 0.85 4172
macro avg 0.80 0.65 0.68 4172
weighted avg 0.83 0.85 0.82 4172
Observations:
y_lrm_test2 = lrm_r2.predict(X2_test_sc)
metrics_score(y2_test_sc, y_lrm_test2)
precision recall f1-score support
0 0.84 0.97 0.90 1416
1 0.70 0.28 0.40 372
accuracy 0.83 1788
macro avg 0.77 0.63 0.65 1788
weighted avg 0.81 0.83 0.80 1788
Observations:
# Creating logistic regression model for data treatment 3
lrm3 = LogisticRegression(penalty = 'none')
lrm3.fit(X3_train_sc, y3_train_sc)
lr3_error_rate = (1 - lrm3.score(X3_train_sc, y3_train_sc))*100
lrm_r3 = LogisticRegression(penalty = 'l1', solver = 'liblinear')
lrm_r3.fit(X3_train_sc, y3_train_sc)
lrm_r3_error_rate = (1 - lrm_r3.score(X3_train_sc, y3_train_sc))*100
print("Logistic Regression error rate using all the features is {}% ".format(np.round(lr3_error_rate, 2)))
print("Logistic Regression error rate using absolute value (Lasso) regularization is {}% ".format(np.round(lrm_r3_error_rate, 2)))
Logistic Regression error rate using all the features is 15.51% Logistic Regression error rate using absolute value (Lasso) regularization is 15.56%
y_lrm_train3 = lrm_r3.predict(X3_train_sc)
metrics_score(y3_train_sc, y_lrm_train3)
precision recall f1-score support
0 0.85 0.97 0.91 3355
1 0.73 0.32 0.45 817
accuracy 0.84 4172
macro avg 0.79 0.65 0.68 4172
weighted avg 0.83 0.84 0.82 4172
Observations:
y_lrm_test3 = lrm_r3.predict(X3_test_sc)
metrics_score(y3_test_sc, y_lrm_test3)
precision recall f1-score support
0 0.84 0.97 0.90 1416
1 0.72 0.28 0.41 372
accuracy 0.83 1788
macro avg 0.78 0.63 0.65 1788
weighted avg 0.81 0.83 0.80 1788
Observations:
# Creating logistic regression model for data treatment 4
lrm4 = LogisticRegression(penalty = 'none')
lrm4.fit(X4_train_sc, y4_train_sc)
lr4_error_rate = (1 - lrm4.score(X4_train_sc, y4_train_sc))*100
lrm_r4 = LogisticRegression(penalty = 'l1', solver = 'liblinear')
lrm_r4.fit(X4_train_sc, y4_train_sc)
lrm_r4_error_rate = (1 - lrm_r4.score(X4_train_sc, y4_train_sc))*100
print("Logistic Regression error rate using all the features is {}% ".format(np.round(lr4_error_rate, 2)))
print("Logistic Regression error rate using absolute value (Lasso) regularization is {}% ".format(np.round(lrm_r4_error_rate, 2)))
Logistic Regression error rate using all the features is 17.43% Logistic Regression error rate using absolute value (Lasso) regularization is 17.43%
y_lrm_train4 = lrm_r4.predict(X4_train_sc)
metrics_score(y4_train_sc, y_lrm_train4)
precision recall f1-score support
0 0.85 0.95 0.90 3355
1 0.61 0.30 0.40 817
accuracy 0.83 4172
macro avg 0.73 0.63 0.65 4172
weighted avg 0.80 0.83 0.80 4172
Observations:
y_lrm_test4 = lrm_r4.predict(X4_test_sc)
metrics_score(y4_test_sc, y_lrm_test4)
precision recall f1-score support
0 0.83 0.96 0.89 1416
1 0.63 0.26 0.37 372
accuracy 0.81 1788
macro avg 0.73 0.61 0.63 1788
weighted avg 0.79 0.81 0.78 1788
Observations:
# Creating logistic regression model for data treatment 5
lrm5 = LogisticRegression(penalty = 'none')
lrm5.fit(X5_train_sc, y5_train_sc)
lr5_error_rate = (1 - lrm5.score(X5_train_sc, y5_train_sc))*100
lrm_r5 = LogisticRegression(penalty = 'l1', solver = 'liblinear')
lrm_r5.fit(X5_train_sc, y5_train_sc)
lrm_r5_error_rate = (1 - lrm_r5.score(X5_train_sc, y5_train_sc))*100
print("Logistic Regression error rate using all the features is {}% ".format(np.round(lr5_error_rate, 2)))
print("Logistic Regression error rate using absolute value (Lasso) regularization is {}% ".format(np.round(lrm_r5_error_rate, 2)))
Logistic Regression error rate using all the features is 7.94% Logistic Regression error rate using absolute value (Lasso) regularization is 8.03%
y_lrm_train5 = lrm_r5.predict(X5_train_sc)
metrics_score(y5_train_sc, y_lrm_train5)
precision recall f1-score support
0 0.92 1.00 0.96 2151
1 1.00 0.07 0.13 203
accuracy 0.92 2354
macro avg 0.96 0.53 0.54 2354
weighted avg 0.93 0.92 0.89 2354
Observations:
y_lrm_test5 = lrm_r5.predict(X5_test_sc)
metrics_score(y5_test_sc, y_lrm_test5)
precision recall f1-score support
0 0.90 1.00 0.95 913
1 0.00 0.00 0.00 97
accuracy 0.90 1010
macro avg 0.45 0.50 0.47 1010
weighted avg 0.82 0.90 0.86 1010
Observations:
Creating SMOTE datasets with scaled values
# SMOTE datasets for logisitic regression, scaled
sm2_sc = SMOTE(random_state=42)
X2_smosc, y2_smosc = sm2_sc.fit_resample(X2_train_sc, y2_train_sc)
sm3_sc = SMOTE(random_state=42)
X3_smosc, y3_smosc = sm3_sc.fit_resample(X3_train_sc, y3_train_sc)
# Creating logistic regression model for data treatment 2 with SMOTE analysis
lr2_smo = LogisticRegression(penalty = 'none')
lr2_smo.fit(X2_smosc, y2_smosc)
lr2_smo_error_rate = (1 - lr2_smo.score(X2_smosc, y2_smosc))*100
lrl2_smo = LogisticRegression(penalty = 'l1', solver = 'liblinear')
lrl2_smo.fit(X2_smosc, y2_smosc)
lrl2_smo_error_rate = (1 - lrl2_smo.score(X2_smosc, y2_smosc))*100
print("Logistic Regression error rate using all the features is {}% ".format(np.round(lr2_smo_error_rate, 2)))
print("Logistic Regression error rate using absolute value (Lasso) regularization is {}% ".format(np.round(lrl2_smo_error_rate, 2)))
Logistic Regression error rate using all the features is 28.27% Logistic Regression error rate using absolute value (Lasso) regularization is 28.26%
y_lrl_smo_train2 = lrl2_smo.predict(X2_smosc)
metrics_score(y2_smosc, y_lrl_smo_train2)
precision recall f1-score support
0 0.70 0.76 0.73 3355
1 0.74 0.67 0.70 3355
accuracy 0.72 6710
macro avg 0.72 0.72 0.72 6710
weighted avg 0.72 0.72 0.72 6710
y_lrl_smo_test2 = lrl2_smo.predict(X2_test_sc)
metrics_score(y2_test_sc, y_lrl_smo_test2)
precision recall f1-score support
0 0.90 0.76 0.83 1416
1 0.43 0.67 0.52 372
accuracy 0.75 1788
macro avg 0.66 0.72 0.67 1788
weighted avg 0.80 0.75 0.76 1788
Observations:
# Creating logistic regression model for data treatment 3 with SMOTE analysis
lr3_smo = LogisticRegression(penalty = 'none')
lr3_smo.fit(X3_smosc, y3_smosc)
lr3_smo_error_rate = (1 - lr3_smo.score(X3_smosc, y3_smosc))*100
lrl3_smo = LogisticRegression(penalty = 'l1', solver = 'liblinear')
lrl3_smo.fit(X3_smosc, y3_smosc)
lrl3_smo_error_rate = (1 - lrl3_smo.score(X3_smosc, y3_smosc))*100
print("Logistic Regression error rate using all the features is {}% ".format(np.round(lr3_smo_error_rate, 2)))
print("Logistic Regression error rate using absolute value (Lasso) regularization is {}% ".format(np.round(lrl3_smo_error_rate, 2)))
Logistic Regression error rate using all the features is 28.26% Logistic Regression error rate using absolute value (Lasso) regularization is 28.3%
y_lrl_smo_train3 = lrl3_smo.predict(X3_smosc)
metrics_score(y3_smosc, y_lrl_smo_train3)
precision recall f1-score support
0 0.70 0.76 0.73 3355
1 0.74 0.68 0.71 3355
accuracy 0.72 6710
macro avg 0.72 0.72 0.72 6710
weighted avg 0.72 0.72 0.72 6710
y_lrl_smo_test3 = lrl3_smo.predict(X3_test_sc)
metrics_score(y3_test_sc, y_lrl_smo_test3)
precision recall f1-score support
0 0.90 0.76 0.82 1416
1 0.43 0.67 0.52 372
accuracy 0.74 1788
macro avg 0.66 0.72 0.67 1788
weighted avg 0.80 0.74 0.76 1788
Observations:
# Printing the coefficients of logistic regression for data treatment 2 with SMOTE analysis
cols = X2_sc.columns
coef_lg = lrl2_smo.coef_
pd.DataFrame(coef_lg,columns = cols).T.sort_values(by = 0, ascending = False)
| 0 | |
|---|---|
| DELINQ | 0.822106 |
| DEBTINC | 0.619645 |
| DEROG | 0.491490 |
| NINQ | 0.277293 |
| VALUE | 0.152664 |
| JOB_Sales | 0.107191 |
| JOB_Self | 0.049928 |
| JOB_Mgr | 0.014291 |
| REASON_HomeImp | 0.000281 |
| JOB_Other | 0.000000 |
| JOB_ProfExe | -0.044537 |
| REASON_DebtCon | -0.068317 |
| MORTDUE | -0.131547 |
| YOJ | -0.148823 |
| CLNO | -0.197611 |
| LOAN | -0.220999 |
| JOB_Office | -0.255672 |
| CLAGE | -0.481833 |
Observations:
# Finding the odds for data treatment 2
odds = np.exp(lrl2_smo.coef_[0])
# Adding the odds to a DataFrame and sorting the values
pd.DataFrame(odds, X2_train.columns, columns = ['odds']).sort_values(by = 'odds', ascending = False)
| odds | |
|---|---|
| DELINQ | 2.275286 |
| DEBTINC | 1.858268 |
| DEROG | 1.634750 |
| NINQ | 1.319553 |
| VALUE | 1.164933 |
| JOB_Sales | 1.113147 |
| JOB_Self | 1.051195 |
| JOB_Mgr | 1.014394 |
| REASON_HomeImp | 1.000281 |
| JOB_Other | 1.000000 |
| JOB_ProfExe | 0.956441 |
| REASON_DebtCon | 0.933964 |
| MORTDUE | 0.876738 |
| YOJ | 0.861722 |
| CLNO | 0.820689 |
| LOAN | 0.801717 |
| JOB_Office | 0.774396 |
| CLAGE | 0.617650 |
# Plotting Precision-Recall Curve for data treatment 2
y_scores_lg = lrl2_smo.predict_proba(X2_smosc)
precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y2_smosc, y_scores_lg[:, 1])
plt.figure(figsize = (10, 7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label = 'precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc = 'upper left')
plt.ylim([0, 1])
plt.axvline(0.46)
plt.show()
Observations:
optimal_threshold1 = .46
y_pred_train2_lda = lrl2_smo.predict_proba(X2_smosc)
metrics_score(y2_smosc, y_pred_train2_lda[:, 1:] > optimal_threshold1)
precision recall f1-score support
0 0.72 0.72 0.72 3355
1 0.72 0.72 0.72 3355
accuracy 0.72 6710
macro avg 0.72 0.72 0.72 6710
weighted avg 0.72 0.72 0.72 6710
optimal_threshold1 = .46
y_pred_test2_lda = lrl2_smo.predict_proba(X2_test_sc)
metrics_score(y2_test_sc, y_pred_test2_lda[:, 1:] > optimal_threshold1)
precision recall f1-score support
0 0.91 0.73 0.81 1416
1 0.41 0.73 0.53 372
accuracy 0.73 1788
macro avg 0.66 0.73 0.67 1788
weighted avg 0.81 0.73 0.75 1788
Observations:
knn2_smo = KNeighborsClassifier()
# We select the optimal value of K for which the error rate is the least in the validation data
# Let us loop over a few values of K to determine the optimal value of K
train_error = []
test_error = []
knn_many_split2 = {}
error_df_knn = pd.DataFrame()
features = X2.columns
for k in range(1, 15):
train_error = []
test_error = []
lista = []
knn2_smo = KNeighborsClassifier(n_neighbors = k)
for i in range(30):
x_train_new2, x_val2, y_train_new2, y_val2 = train_test_split(X2_smosc, y2_smosc, test_size = 0.30)
# Fitting K-NN on the training data
knn2_smo.fit(x_train_new2, y_train_new2)
# Calculating error on the training data and the validation data
train_error.append(1 - knn2_smo.score(x_train_new2, y_train_new2))
test_error.append(1 - knn2_smo.score(x_val2, y_val2))
lista.append(sum(train_error)/len(train_error))
lista.append(sum(test_error)/len(test_error))
knn_many_split2[k] = lista
knn_many_split2
{1: [0.0, 0.01631064745818844],
2: [0.007714143779717549, 0.029094220897499586],
3: [0.01163863458945426, 0.031892697466467966],
4: [0.02058760911219928, 0.0439311144229177],
5: [0.024135973316301188, 0.05161450571286639],
6: [0.03285785252998368, 0.06189766517635371],
7: [0.040919736001703226, 0.07544295413147871],
8: [0.04932226243701654, 0.08271236959761552],
9: [0.060747995174224675, 0.09908925318761384],
10: [0.06757504790291677, 0.10438814373240599],
11: [0.08152721595344545, 0.11813214108296077],
12: [0.08809878645944219, 0.1238615664845173],
13: [0.10246256475764673, 0.13967544295413148],
14: [0.10914058618976652, 0.13982447425070374]}
kltest = []
vltest = []
for k, v in knn_many_split2.items():
kltest.append(k)
vltest.append(knn_many_split2[k][1])
kltrain = []
vltrain = []
for k, v in knn_many_split2.items():
kltrain.append(k)
vltrain.append(knn_many_split2[k][0])
# Plotting K vs Error
plt.figure(figsize = (10, 6))
plt.plot(kltest, vltest, label = 'test' )
plt.plot(kltrain, vltrain, label = 'train')
plt.legend()
plt.show()
Observations:
# Define K-NN model
knn = KNeighborsClassifier(n_neighbors = 14)
# Fitting data to the K-NN model
knn.fit(X2_smosc,y2_smosc)
KNeighborsClassifier(n_neighbors=14)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=14)
# Checking the performance of K-NN model on the training data
y_pred_train_knn = knn.predict(X2_smosc)
metrics_score(y2_smosc, y_pred_train_knn)
precision recall f1-score support
0 0.93 0.94 0.94 3355
1 0.94 0.93 0.94 3355
accuracy 0.94 6710
macro avg 0.94 0.94 0.94 6710
weighted avg 0.94 0.94 0.94 6710
# Checking the performance of K-NN model on the training data
y_pred_test_knn = knn.predict(X2_test_sc)
metrics_score(y2_test_sc, y_pred_test_knn)
precision recall f1-score support
0 0.93 0.92 0.92 1416
1 0.69 0.72 0.71 372
accuracy 0.88 1788
macro avg 0.81 0.82 0.81 1788
weighted avg 0.88 0.88 0.88 1788
Observations:
# Tuning the KNN model
params_knn = {'n_neighbors': np.arange(3, 15), 'weights': ['uniform', 'distance'], 'p': [1, 2]}
grid_knn = GridSearchCV(estimator = knn, param_grid = params_knn, scoring = 'recall', cv = 10)
model_knn = grid_knn.fit(X2_smosc, y2_smosc)
knn_estimator = model_knn.best_estimator_
print(knn_estimator)
KNeighborsClassifier(n_neighbors=3, p=1, weights='distance')
# Fit the best estimator on the training data
knn_estimator.fit(X2_smosc, y2_smosc)
KNeighborsClassifier(n_neighbors=3, p=1, weights='distance')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3, p=1, weights='distance')
y_pred_train_knn_estimator = knn_estimator.predict(X2_smosc)
metrics_score(y2_smosc, y_pred_train_knn_estimator)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 3355
accuracy 1.00 6710
macro avg 1.00 1.00 1.00 6710
weighted avg 1.00 1.00 1.00 6710
y_pred_test_knn_estimator = knn_estimator.predict(X2_test_sc)
metrics_score(y2_test_sc, y_pred_test_knn_estimator)
precision recall f1-score support
0 0.95 0.99 0.97 1416
1 0.97 0.80 0.88 372
accuracy 0.95 1788
macro avg 0.96 0.90 0.92 1788
weighted avg 0.95 0.95 0.95 1788
Observations:
# Function to calculate recall score to place in comparison table
def get_recall_score(model,flag=True,X_train=X2_smo,X_test=X2_test):
'''
model : classifier to predict values of X
'''
a = [] # defining an empty list to store train and test results
pred_train = model.predict(X2_smo)
pred_test = model.predict(X2_test)
train_recall = metrics.recall_score(y2_smo,pred_train)
test_recall = metrics.recall_score(y2_test,pred_test)
a.append(train_recall) # adding train recall to list
a.append(test_recall) # adding test recall to list
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Recall on training set : ",metrics.recall_score(y2_smo,pred_train))
print("Recall on test set : ",metrics.recall_score(y2_test,pred_test))
return a # returning the list with train and test scores
# Function to calculate precision score to place in comparison table
def get_precision_score(model,flag=True,X_train=X2_smo,X_test=X2_test):
'''
model : classifier to predict values of X
'''
b = [] # defining an empty list to store train and test results
pred_train = model.predict(X2_smo)
pred_test = model.predict(X2_test)
train_precision = metrics.precision_score(y2_smo,pred_train)
test_precision = metrics.precision_score(y2_test,pred_test)
b.append(train_precision) # adding train precision to list
b.append(test_precision) # adding test precision to list
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Precision on training set : ",metrics.precision_score(y2_smo,pred_train))
print("Precision on test set : ",metrics.precision_score(y2_test,pred_test))
return b # returning the list with train and test scores
## Function to calculate accuracy score
def get_accuracy_score(model,flag=True,X_train=X2_smo,X_test=X2_test):
'''
model : classifier to predict values of X
'''
c = [] # defining an empty list to store train and test results
train_acc = model.score(X2_smo,y2_smo)
test_acc = model.score(X2_test,y2_test)
c.append(train_acc) # adding train accuracy to list
c.append(test_acc) # adding test accuracy to list
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X2_smo,y2_smo))
print("Accuracy on test set : ",model.score(X2_test,y2_test))
return c # returning the list with train and test scores
# Make the list of model names
models = [d_tree2_smo, d_tree_tuned, d_tree_tuned2, rf_estimator2_smo, rf_estimator_tuned, rf_estimator_tuned2]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the accuracy,recall and precision scores
for model in models:
# accuracy score
j = get_accuracy_score(model,False)
acc_train.append(j[0])
acc_test.append(j[1])
# recall score
k = get_recall_score(model,False)
recall_train.append(k[0])
recall_test.append(k[1])
# precision score
l = get_precision_score(model,False)
precision_train.append(l[0])
precision_test.append(l[1])
# Make table of model comparisons
comparison_frame = pd.DataFrame({'Model':['Decision Tree 70/30', 'Dec Tree Tuned', 'Dec Tree Tuned 2', 'Random Forest 70/30', 'Ran Forest Tuned', 'Ran Forest Tuned 2'],
'Train_Accuracy': acc_train,
'Test_Accuracy': acc_test,
'Train_Recall': recall_train,
'Test_Recall': recall_test,
'Train_Precision': precision_train,
'Test_Precision': precision_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | Decision Tree 70/30 | 1.000000 | 0.865772 | 1.000000 | 0.639785 | 1.000000 | 0.691860 |
| 1 | Dec Tree Tuned | 0.903428 | 0.865213 | 0.871237 | 0.612903 | 0.931188 | 0.701538 |
| 2 | Dec Tree Tuned 2 | 0.845902 | 0.837808 | 0.833383 | 0.766129 | 0.854784 | 0.584016 |
| 3 | Random Forest 70/30 | 1.000000 | 0.937360 | 1.000000 | 0.768817 | 1.000000 | 0.916667 |
| 4 | Ran Forest Tuned | 0.910432 | 0.866890 | 0.881967 | 0.602151 | 0.935209 | 0.713376 |
| 5 | Ran Forest Tuned 2 | 0.882563 | 0.871924 | 0.853949 | 0.706989 | 0.905786 | 0.686684 |
# Functions to calculate different scores for scaled data
def get_recall_score2(model,flag=True,X_train=X2_smosc,X_test=X2_test_sc):
'''
model : classifier to predict values of X
'''
a = [] # defining an empty list to store train and test results
pred_train = model.predict(X2_smosc)
pred_test = model.predict(X2_test_sc)
train_recall = metrics.recall_score(y2_smosc,pred_train)
test_recall = metrics.recall_score(y2_test_sc,pred_test)
a.append(train_recall) # adding train recall to list
a.append(test_recall) # adding test recall to list
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Recall on training set : ",metrics.recall_score(y2_smosc,pred_train))
print("Recall on test set : ",metrics.recall_score(y2_test_sc,pred_test))
return a # returning the list with train and test scores
def get_precision_score2(model,flag=True,X_train=X2_smosc,X_test=X2_test_sc):
'''
model : classifier to predict values of X
'''
b = [] # defining an empty list to store train and test results
pred_train = model.predict(X2_smosc)
pred_test = model.predict(X2_test_sc)
train_precision = metrics.precision_score(y2_smosc,pred_train)
test_precision = metrics.precision_score(y2_test_sc,pred_test)
b.append(train_precision) # adding train precision to list
b.append(test_precision) # adding test precision to list
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Precision on training set : ",metrics.precision_score(y2_smosc,pred_train))
print("Precision on test set : ",metrics.precision_score(y2_test_sc,pred_test))
return b # returning the list with train and test scores
def get_accuracy_score2(model,flag=True,X_train=X2_smosc,X_test=X2_test_sc):
'''
model : classifier to predict values of X
'''
c = [] # defining an empty list to store train and test results
train_acc = model.score(X2_smosc,y2_smosc)
test_acc = model.score(X2_test_sc,y2_test_sc)
c.append(train_acc) # adding train accuracy to list
c.append(test_acc) # adding test accuracy to list
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X2_smosc,y2_smosc))
print("Accuracy on test set : ",model.score(X2_test_sc,y2_test_sc))
return c # returning the list with train and test scores
# Make the list of all the model names for 75/25 splits
models = [lr2_smo, lrl2_smo, knn, knn_estimator]
# defining empty lists to add train and test results
acc_train2 = []
acc_test2 = []
recall_train2 = []
recall_test2 = []
precision_train2 = []
precision_test2 = []
# looping through all the models to get the accuracy,recall and precision scores
for model in models:
# accuracy score
j = get_accuracy_score2(model,False)
acc_train2.append(j[0])
acc_test2.append(j[1])
# recall score
k = get_recall_score2(model,False)
recall_train2.append(k[0])
recall_test2.append(k[1])
# precision score
l = get_precision_score2(model,False)
precision_train2.append(l[0])
precision_test2.append(l[1])
comparison_frame2 = pd.DataFrame({'Model':['Logistic Regression','Logistic Regression with Lasso', 'KNN', 'KNN Tuned'],
'Train_Accuracy': acc_train2,
'Test_Accuracy': acc_test2,
'Train_Recall': recall_train2,
'Test_Recall': recall_test2,
'Train_Precision': precision_train2,
'Test_Precision': precision_test2})
comparison_frame2
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.717288 | 0.746085 | 0.671237 | 0.672043 | 0.739330 | 0.429553 |
| 1 | Logistic Regression with Lasso | 0.717437 | 0.745526 | 0.671237 | 0.672043 | 0.739573 | 0.428816 |
| 2 | KNN | 0.935171 | 0.875280 | 0.934724 | 0.720430 | 0.935561 | 0.692506 |
| 3 | KNN Tuned | 1.000000 | 0.953020 | 1.000000 | 0.801075 | 1.000000 | 0.967532 |
# Adding results dataframes together
frames = [comparison_frame, comparison_frame2]
comb_result = pd.concat(frames)
comb_result.reset_index()
| index | Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Decision Tree 70/30 | 1.000000 | 0.865772 | 1.000000 | 0.639785 | 1.000000 | 0.691860 |
| 1 | 1 | Dec Tree Tuned | 0.903428 | 0.865213 | 0.871237 | 0.612903 | 0.931188 | 0.701538 |
| 2 | 2 | Dec Tree Tuned 2 | 0.845902 | 0.837808 | 0.833383 | 0.766129 | 0.854784 | 0.584016 |
| 3 | 3 | Random Forest 70/30 | 1.000000 | 0.937360 | 1.000000 | 0.768817 | 1.000000 | 0.916667 |
| 4 | 4 | Ran Forest Tuned | 0.910432 | 0.866890 | 0.881967 | 0.602151 | 0.935209 | 0.713376 |
| 5 | 5 | Ran Forest Tuned 2 | 0.882563 | 0.871924 | 0.853949 | 0.706989 | 0.905786 | 0.686684 |
| 6 | 0 | Logistic Regression | 0.717288 | 0.746085 | 0.671237 | 0.672043 | 0.739330 | 0.429553 |
| 7 | 1 | Logistic Regression with Lasso | 0.717437 | 0.745526 | 0.671237 | 0.672043 | 0.739573 | 0.428816 |
| 8 | 2 | KNN | 0.935171 | 0.875280 | 0.934724 | 0.720430 | 0.935561 | 0.692506 |
| 9 | 3 | KNN Tuned | 1.000000 | 0.953020 | 1.000000 | 0.801075 | 1.000000 | 0.967532 |
Data Exploration
Data Treatment
Model Comparison
Improving Performance
Current Model to be Adopted
Importance of different features